Project 2#

import plotly.io as pio

pio.renderers.default = "vscode+jupyterlab+notebook_connected"

Deforestation and CO2 Emission#

Deforestation has been increasing around the world, driven largely by industrialization, and it is playing a significant role in contributing to climate change. This issue has caught my attention, and I am interested in studying how deforestation impacts CO2 emissions. By understanding this connection, we can better grasp how deforestation acts as a driver of climate change.

Research Focus

  1. Research Question:
    How does deforestation correlate with CO2 emissions globally from 2001 to 2019?

  2. Hypothesis:
    Higher rates of deforestation contribute to increased CO2 emissions, with significant regional differences.

Data Sources

  1. Forest Dataset:
    Sourced from Global Forest Watch, an organization dedicated to providing real-time data and tools for monitoring forests worldwide.

  2. CO2 Emission Dataset:
    Obtained from Kaggle, which offers a dataset on CO2 emissions, growth, and population by country.

Step 1 - Import Library and Data Management#

import pandas as pd
import plotly.express as px
forest = pd.read_csv("treecover_loss__ha_1.csv")
forest
iso umd_tree_cover_loss__year umd_tree_cover_loss__ha
0 AFG 2001 88.092712
1 AGO 2001 101220.621525
2 AIA 2001 3.878461
3 ALA 2001 396.934826
4 ALB 2001 3729.021031
... ... ... ...
4566 XKO 2023 1465.438575
4567 XNC 2023 41.029104
4568 ZAF 2023 29571.219239
4569 ZMB 2023 190416.586825
4570 ZWE 2023 5690.371581

4571 rows × 3 columns

forest.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4571 entries, 0 to 4570
Data columns (total 3 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   iso                        4571 non-null   object 
 1   umd_tree_cover_loss__year  4571 non-null   int64  
 2   umd_tree_cover_loss__ha    4571 non-null   float64
dtypes: float64(1), int64(1), object(1)
memory usage: 107.3+ KB

Step 2 - Data Preprocessing:#

  • The forest dataset from Global Forest Watch includes a column named iso, which contains country codes following the International Organization for Standardization (ISO) standard. However, it does not include country names or additional specifications.

  • To address this, we need to identify and load an additional dataset containing ISO codes alongside country names and specifications. This will allow us to link the forest data to corresponding countries for further analysis.

  • Based on the output of the info() function, we can confirm that the data types for each column are appropriately formatted for analysis.

  • To improve readability, we need to rename the column umd_tree_cover_loss__year to year.

forest.rename(columns={
    'umd_tree_cover_loss__year': 'year',
}, inplace=True)
forest
iso year umd_tree_cover_loss__ha
0 AFG 2001 88.092712
1 AGO 2001 101220.621525
2 AIA 2001 3.878461
3 ALA 2001 396.934826
4 ALB 2001 3729.021031
... ... ... ...
4566 XKO 2023 1465.438575
4567 XNC 2023 41.029104
4568 ZAF 2023 29571.219239
4569 ZMB 2023 190416.586825
4570 ZWE 2023 5690.371581

4571 rows × 3 columns

Step 3 - Load the Continents Data#

continents = pd.read_csv('continents2.csv')
continents
name alpha-2 alpha-3 country-code iso_3166-2 region sub-region intermediate-region region-code sub-region-code intermediate-region-code
0 Afghanistan AF AFG 4 ISO 3166-2:AF Asia Southern Asia NaN 142.0 34.0 NaN
1 Åland Islands AX ALA 248 ISO 3166-2:AX Europe Northern Europe NaN 150.0 154.0 NaN
2 Albania AL ALB 8 ISO 3166-2:AL Europe Southern Europe NaN 150.0 39.0 NaN
3 Algeria DZ DZA 12 ISO 3166-2:DZ Africa Northern Africa NaN 2.0 15.0 NaN
4 American Samoa AS ASM 16 ISO 3166-2:AS Oceania Polynesia NaN 9.0 61.0 NaN
... ... ... ... ... ... ... ... ... ... ... ...
244 Wallis and Futuna WF WLF 876 ISO 3166-2:WF Oceania Polynesia NaN 9.0 61.0 NaN
245 Western Sahara EH ESH 732 ISO 3166-2:EH Africa Northern Africa NaN 2.0 15.0 NaN
246 Yemen YE YEM 887 ISO 3166-2:YE Asia Western Asia NaN 142.0 145.0 NaN
247 Zambia ZM ZMB 894 ISO 3166-2:ZM Africa Sub-Saharan Africa Eastern Africa 2.0 202.0 14.0
248 Zimbabwe ZW ZWE 716 ISO 3166-2:ZW Africa Sub-Saharan Africa Eastern Africa 2.0 202.0 14.0

249 rows × 11 columns

  • After loading the continent data, we can drop unnecessary columns and retain only the country (stored as name), region, and sub-region information based on the iso codes. Additionally, we need to rename the name column to country to ensure readability.

continents_region = continents[['alpha-3','name','region','sub-region']]
continents_region = continents[['alpha-3', 'name', 'region', 'sub-region']].copy()
continents_region.rename(columns={'region': 'continent_region',
                                  'name': 'country',
                                  'sub-region': 'continent_sub_region'}, inplace=True)
continents_region
alpha-3 country continent_region continent_sub_region
0 AFG Afghanistan Asia Southern Asia
1 ALA Åland Islands Europe Northern Europe
2 ALB Albania Europe Southern Europe
3 DZA Algeria Africa Northern Africa
4 ASM American Samoa Oceania Polynesia
... ... ... ... ...
244 WLF Wallis and Futuna Oceania Polynesia
245 ESH Western Sahara Africa Northern Africa
246 YEM Yemen Asia Western Asia
247 ZMB Zambia Africa Sub-Saharan Africa
248 ZWE Zimbabwe Africa Sub-Saharan Africa

249 rows × 4 columns

Step 4 - Data Merging and Further Cleaning#

We are now ready to merge the deforestation dataset with the continent data to include detailed regional information.

forest_region = pd.merge(
    forest,
    continents_region,
    left_on='iso',
    right_on='alpha-3',
    how='left'
)

forest_region.drop(columns=['alpha-3'], inplace=True)
forest_region
iso year umd_tree_cover_loss__ha country continent_region continent_sub_region
0 AFG 2001 88.092712 Afghanistan Asia Southern Asia
1 AGO 2001 101220.621525 Angola Africa Sub-Saharan Africa
2 AIA 2001 3.878461 Anguilla Americas Latin America and the Caribbean
3 ALA 2001 396.934826 Åland Islands Europe Northern Europe
4 ALB 2001 3729.021031 Albania Europe Southern Europe
... ... ... ... ... ... ...
4566 XKO 2023 1465.438575 NaN NaN NaN
4567 XNC 2023 41.029104 NaN NaN NaN
4568 ZAF 2023 29571.219239 South Africa Africa Sub-Saharan Africa
4569 ZMB 2023 190416.586825 Zambia Africa Sub-Saharan Africa
4570 ZWE 2023 5690.371581 Zimbabwe Africa Sub-Saharan Africa

4571 rows × 6 columns

As observed, there are still several ISO codes that are not included in our region dataset. To address this, we need to identify these missing codes. If possible, we should supplement the dataset by filling in the appropriate country names and regional information from other reliable sources.

nan_data = forest_region[forest_region['continent_region'].isna()]
nan_data
iso year umd_tree_cover_loss__ha country continent_region continent_sub_region
200 XAD 2001 1.648989 NaN NaN NaN
201 XCA 2001 9.735251 NaN NaN NaN
202 XKO 2001 1122.205429 NaN NaN NaN
203 XNC 2001 17.661583 NaN NaN NaN
407 XAD 2002 0.507202 NaN NaN NaN
... ... ... ... ... ... ...
4372 XKO 2022 784.685260 NaN NaN NaN
4373 XNC 2022 527.159030 NaN NaN NaN
4565 XCA 2023 0.989440 NaN NaN NaN
4566 XKO 2023 1465.438575 NaN NaN NaN
4567 XNC 2023 41.029104 NaN NaN NaN

87 rows × 6 columns

nan_data['iso'].unique()
array(['XAD', 'XCA', 'XKO', 'XNC'], dtype=object)

According to ISO 3166-1, codes starting with ‘X’ are reserved for user-assigned purposes and do not officially represent recognized countries. However, these codes are often used informally in datasets and applications to denote specific regions or entities. Here’s how the following codes are interpreted:

  • XAD: Commonly used to denote Andorra.

  • XKO: Typically used to represent Kosovo.

  • XNC: Frequently stands for New Caledonia.

The code XCA appears incomplete or undocumented. As a result, I have decided to drop it from the analysis.

update_values = {
    'XAD': {'country': 'Andorra', 'continent_region': 'Europe', 'continent_sub_region': 'Southern Europe'},
    'XKO': {'country': 'Kosovo', 'continent_region': 'Europe', 'continent_sub_region': 'Southern Europe'},
    'XNC': {'country': 'New Caledonia', 'continent_region': 'Oceania', 'continent_sub_region': 'Melanesia'}
}

for iso, values in update_values.items():
    forest_region.loc[forest_region['iso'] == iso,
                      ['country', 'continent_region', 'continent_sub_region']] = values.values()
forest_region
iso year umd_tree_cover_loss__ha country continent_region continent_sub_region
0 AFG 2001 88.092712 Afghanistan Asia Southern Asia
1 AGO 2001 101220.621525 Angola Africa Sub-Saharan Africa
2 AIA 2001 3.878461 Anguilla Americas Latin America and the Caribbean
3 ALA 2001 396.934826 Åland Islands Europe Northern Europe
4 ALB 2001 3729.021031 Albania Europe Southern Europe
... ... ... ... ... ... ...
4566 XKO 2023 1465.438575 Kosovo Europe Southern Europe
4567 XNC 2023 41.029104 New Caledonia Oceania Melanesia
4568 ZAF 2023 29571.219239 South Africa Africa Sub-Saharan Africa
4569 ZMB 2023 190416.586825 Zambia Africa Sub-Saharan Africa
4570 ZWE 2023 5690.371581 Zimbabwe Africa Sub-Saharan Africa

4571 rows × 6 columns

forest_region = forest_region[forest_region['iso'] != 'XCA']
forest_region
iso year umd_tree_cover_loss__ha country continent_region continent_sub_region
0 AFG 2001 88.092712 Afghanistan Asia Southern Asia
1 AGO 2001 101220.621525 Angola Africa Sub-Saharan Africa
2 AIA 2001 3.878461 Anguilla Americas Latin America and the Caribbean
3 ALA 2001 396.934826 Åland Islands Europe Northern Europe
4 ALB 2001 3729.021031 Albania Europe Southern Europe
... ... ... ... ... ... ...
4566 XKO 2023 1465.438575 Kosovo Europe Southern Europe
4567 XNC 2023 41.029104 New Caledonia Oceania Melanesia
4568 ZAF 2023 29571.219239 South Africa Africa Sub-Saharan Africa
4569 ZMB 2023 190416.586825 Zambia Africa Sub-Saharan Africa
4570 ZWE 2023 5690.371581 Zimbabwe Africa Sub-Saharan Africa

4551 rows × 6 columns

forest_region[forest_region['continent_region'].isna()]
iso year umd_tree_cover_loss__ha country continent_region continent_sub_region

After verifying the result, we can confirm that there are no null values in the country column anymore. This ensures that all entries now have valid and complete country information.

Step 5 - Emission Data#

We are now ready to load the emission dataset for further analysis.

emission = pd.read_csv("energy.csv")
emission
Unnamed: 0 Country Energy_type Year Energy_consumption Energy_production GDP Population Energy_intensity_per_capita Energy_intensity_by_GDP CO2_emission
0 0 World all_energy_types 1980 292.899790 296.337228 27770.910281 4.298127e+06 68.145921 10.547000 4946.627130
1 1 World coal 1980 78.656134 80.114194 27770.910281 4.298127e+06 68.145921 10.547000 1409.790188
2 2 World natural_gas 1980 53.865223 54.761046 27770.910281 4.298127e+06 68.145921 10.547000 1081.593377
3 3 World petroleum_n_other_liquids 1980 132.064019 133.111109 27770.910281 4.298127e+06 68.145921 10.547000 2455.243565
4 4 World nuclear 1980 7.575700 7.575700 27770.910281 4.298127e+06 68.145921 10.547000 0.000000
... ... ... ... ... ... ... ... ... ... ... ...
55435 55435 Zimbabwe coal 2019 0.045064 0.075963 37.620400 1.465420e+04 11.508701 4.482962 4.586869
55436 55436 Zimbabwe natural_gas 2019 0.000000 0.000000 37.620400 1.465420e+04 11.508701 4.482962 0.000000
55437 55437 Zimbabwe petroleum_n_other_liquids 2019 0.055498 0.000000 37.620400 1.465420e+04 11.508701 4.482962 4.377890
55438 55438 Zimbabwe nuclear 2019 NaN NaN 37.620400 1.465420e+04 11.508701 4.482962 0.000000
55439 55439 Zimbabwe renewables_n_other 2019 0.068089 0.067499 37.620400 1.465420e+04 11.508701 4.482962 0.000000

55440 rows × 11 columns

emission.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55440 entries, 0 to 55439
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   Unnamed: 0                   55440 non-null  int64  
 1   Country                      55440 non-null  object 
 2   Energy_type                  55440 non-null  object 
 3   Year                         55440 non-null  int64  
 4   Energy_consumption           44287 non-null  float64
 5   Energy_production            44289 non-null  float64
 6   GDP                          40026 non-null  float64
 7   Population                   46014 non-null  float64
 8   Energy_intensity_per_capita  50358 non-null  float64
 9   Energy_intensity_by_GDP      50358 non-null  float64
 10  CO2_emission                 51614 non-null  float64
dtypes: float64(7), int64(2), object(2)
memory usage: 4.7+ MB

From the output of the info() function, we can see that the columns we intend to use—Country, CO2_emission, and Year—are stored in the appropriate dataset. To ensure consistency and avoid issues when merging the data, we will apply some minor formatting by renaming:

  • Year to year

  • Country to country

This step ensures uniform column naming across datasets.

emission.rename(columns={
    'Year': 'year',
    'Country': 'country'
}, inplace=True)
emission
Unnamed: 0 country Energy_type year Energy_consumption Energy_production GDP Population Energy_intensity_per_capita Energy_intensity_by_GDP CO2_emission
0 0 World all_energy_types 1980 292.899790 296.337228 27770.910281 4.298127e+06 68.145921 10.547000 4946.627130
1 1 World coal 1980 78.656134 80.114194 27770.910281 4.298127e+06 68.145921 10.547000 1409.790188
2 2 World natural_gas 1980 53.865223 54.761046 27770.910281 4.298127e+06 68.145921 10.547000 1081.593377
3 3 World petroleum_n_other_liquids 1980 132.064019 133.111109 27770.910281 4.298127e+06 68.145921 10.547000 2455.243565
4 4 World nuclear 1980 7.575700 7.575700 27770.910281 4.298127e+06 68.145921 10.547000 0.000000
... ... ... ... ... ... ... ... ... ... ... ...
55435 55435 Zimbabwe coal 2019 0.045064 0.075963 37.620400 1.465420e+04 11.508701 4.482962 4.586869
55436 55436 Zimbabwe natural_gas 2019 0.000000 0.000000 37.620400 1.465420e+04 11.508701 4.482962 0.000000
55437 55437 Zimbabwe petroleum_n_other_liquids 2019 0.055498 0.000000 37.620400 1.465420e+04 11.508701 4.482962 4.377890
55438 55438 Zimbabwe nuclear 2019 NaN NaN 37.620400 1.465420e+04 11.508701 4.482962 0.000000
55439 55439 Zimbabwe renewables_n_other 2019 0.068089 0.067499 37.620400 1.465420e+04 11.508701 4.482962 0.000000

55440 rows × 11 columns

Step 6 - Final Data#

data = pd.merge(emission, forest_region, on =["country", "year"], how='inner')
data
Unnamed: 0 country Energy_type year Energy_consumption Energy_production GDP Population Energy_intensity_per_capita Energy_intensity_by_GDP CO2_emission iso umd_tree_cover_loss__ha continent_region continent_sub_region
0 29112 Afghanistan all_energy_types 2001 0.015914 0.007509 19.4201 21607.0 0.736543 0.819486 1.153149 AFG 88.092712 Asia Southern Asia
1 29113 Afghanistan coal 2001 0.000542 0.000515 19.4201 21607.0 0.736543 0.819486 0.001944 AFG 88.092712 Asia Southern Asia
2 29114 Afghanistan natural_gas 2001 0.001849 0.001849 19.4201 21607.0 0.736543 0.819486 0.451205 AFG 88.092712 Asia Southern Asia
3 29115 Afghanistan petroleum_n_other_liquids 2001 0.008037 0.000000 19.4201 21607.0 0.736543 0.819486 0.700000 AFG 88.092712 Asia Southern Asia
4 29116 Afghanistan nuclear 2001 NaN NaN 19.4201 21607.0 0.736543 0.819486 0.000000 AFG 88.092712 Asia Southern Asia
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
19207 55435 Zimbabwe coal 2019 0.045064 0.075963 37.6204 14654.2 11.508701 4.482962 4.586869 ZWE 11553.329511 Africa Sub-Saharan Africa
19208 55436 Zimbabwe natural_gas 2019 0.000000 0.000000 37.6204 14654.2 11.508701 4.482962 0.000000 ZWE 11553.329511 Africa Sub-Saharan Africa
19209 55437 Zimbabwe petroleum_n_other_liquids 2019 0.055498 0.000000 37.6204 14654.2 11.508701 4.482962 4.377890 ZWE 11553.329511 Africa Sub-Saharan Africa
19210 55438 Zimbabwe nuclear 2019 NaN NaN 37.6204 14654.2 11.508701 4.482962 0.000000 ZWE 11553.329511 Africa Sub-Saharan Africa
19211 55439 Zimbabwe renewables_n_other 2019 0.068089 0.067499 37.6204 14654.2 11.508701 4.482962 0.000000 ZWE 11553.329511 Africa Sub-Saharan Africa

19212 rows × 15 columns

We have successfully joined the deforestation and emission datasets using an inner join. This ensures that only the matching data between the two datasets is retained. The merged data is now ready for exploratory data analysis.

Step 7 - Exploratory Data Analysis#

Let’s begin by analyzing the trends in both deforestation and Food Security Index across different sub regions to get some nuances and find specific patterns.

forest_aggregate = data.groupby(['year', 'continent_region']).agg(
    {'umd_tree_cover_loss__ha': 'sum', 'CO2_emission': 'sum'}
).reset_index()

fig = px.line(
    forest_aggregate,
    x="year", 
    y="umd_tree_cover_loss__ha", 
    color="continent_region",
    labels={"year": "Year",
        "umd_tree_cover_loss__ha": "Tree Cover Loss (ha)",
        "continent_region": "Continent",
    },
    title="Deforestation by Region Over Time"
)
fig.show()

The chart highlights how deforestation varies across continents. The Americas show the highest levels of tree cover loss, with noticeable peaks in certain years that could reflect major deforestation events. Europe and Asia have moderate levels of deforestation, with some fluctuations over time. Africa shows a gradual but steady increase in tree cover loss, while Oceania remains relatively low throughout the period. These differences likely point to varying regional challenges and drivers behind deforestation.

fig = px.line(
    forest_aggregate,
    x="year", 
    y="CO2_emission", 
    color="continent_region",
    labels={"year": "Year",
        "CO2_emission": "CO2 Emission",
        "continent_region": "Continent",
    },
    title="CO2 Emission by Region Over Time"
)
fig.show()

Insights from the CO2 Emission by Region Over Time Chart#

  1. Asia:
    Asia’s CO2 emissions have been on a steady rise, making it the biggest contributor among all regions. This isn’t surprising given the rapid industrialization and energy demands of many countries in the region. As economies grow, so does the reliance on energy-intensive industries, which is clearly reflected in the numbers.

  2. Americas:
    In the Americas, CO2 emissions have stayed relatively stable, with only small fluctuations over time. This seems to show a balance—on one hand, industrial activities remain significant, but on the other, there’s been progress in adopting renewable energy and other mitigation measures to keep emissions in check.

  3. Europe:
    Europe’s emissions have either held steady or started to decline slightly. This likely reflects the impact of strong environmental policies and the gradual shift toward cleaner energy sources like wind and solar. It’s an example of how focused efforts can make a difference over time.

  4. Africa:
    Emissions in Africa are still low compared to other regions, but they are slowly creeping up. This aligns with the ongoing industrialization and urbanization across the continent. As economies grow, energy use increases, which contributes to this upward trend.

  5. Oceania:
    Oceania continues to record the lowest emissions of all the regions, and the numbers haven’t changed much over time. This could be because the region has fewer energy-intensive industries and a smaller population compared to places like Asia or the Americas.

Looking at the charts, we can see some interesting patterns when comparing deforestation and CO2 emissions across regions. In the Americas, deforestation has stayed consistently high over the years, with noticeable peaks around 2010 and 2018. However, what stands out is that CO2 emissions in the Americas haven’t followed the same trend, they’ve remained fairly stable. This suggests that other factors, like industrial activities or transportation, might be playing a bigger role in driving emissions here compared to deforestation.

Asia tells a different story. CO2 emissions have been steadily climbing, and there’s a moderate but consistent trend of deforestation as well. This could mean that activities like industrial expansion or agriculture, which often lead to deforestation, are also directly contributing to the region’s rising emissions.

Europe shows yet another dynamic. There’s very little deforestation happening, but CO2 emissions remain high and steady. This suggests that land-use changes aren’t a major factor for emissions in Europe; instead, it’s likely industries and transportation that are driving their numbers.

In Africa, there’s a gradual increase in both deforestation and CO2 emissions. This makes sense given population growth and the expansion of agriculture, which often go hand in hand with deforestation. Lastly, Oceania stands out with minimal deforestation and a relatively small contribution to global CO2 emissions, reflecting its limited overall impact.

What these charts show is that the relationship between deforestation and CO2 emissions varies a lot by region. In places like Asia and Africa, there’s a stronger connection between the two, while in regions like Europe and the Americas, emissions seem to be driven by other factors. It’s a reminder of how different each region’s story is when it comes to environmental challenges.

fig = px.scatter(
    data,
    x="umd_tree_cover_loss__ha",
    y="CO2_emission",
    color="continent_region",
    trendline="ols",
    title="Correlation Between Deforestation and CO2 Emission",
    labels={
        "umd_tree_cover_loss__ha": "Tree Cover Loss (ha)",
        "CO2_emission": "CO2 Emission",
        "continent_region": "Continent"
    }
)

fig.show()

The scatterplot tells an interesting story about how deforestation and CO2 emissions relate to one another across different regions. In Asia, we see the largest spread of deforestation values, paired with the highest CO2 emissions. This makes sense given the region’s rapid industrial growth and large-scale land-use changes, which are major contributors to emissions.

The Americas also show a noticeable pattern. While deforestation values vary, CO2 emissions are clustered in the mid to high range. This suggests that deforestation, along with other industrial activities, plays a significant role in driving emissions in this region.

Europe, on the other hand, has a weaker connection between deforestation and emissions. Even as deforestation varies, CO2 emissions stay relatively steady. This might reflect Europe’s success in adopting cleaner energy and stronger environmental policies that reduce reliance on activities that drive emissions.

In Africa, both deforestation and CO2 emissions are relatively low. This lines up with the region’s slower pace of industrialization and smaller-scale land-use changes. Finally, Oceania has the lowest levels of both deforestation and CO2 emissions, emphasizing its smaller global footprint.

What stands out from the plot is how the relationship between deforestation and emissions changes depending on the region. In Asia and the Americas, the link is clear, while in Europe, Africa, and Oceania, other factors seem to have a stronger influence. This shows that tackling deforestation’s impact on emissions requires a tailored approach for each region.

Step 8 - Correlation between Variables#

data['umd_tree_cover_loss__ha'].describe()
count    1.921200e+04
mean     1.133450e+05
std      4.514337e+05
min      0.000000e+00
25%      2.502050e+02
50%      5.715100e+03
75%      4.494546e+04
max      5.560431e+06
Name: umd_tree_cover_loss__ha, dtype: float64
data['CO2_emission'].describe()
count    19071.000000
mean        59.369662
std        414.280960
min         -0.000138
25%          0.000000
50%          0.190589
75%         10.969902
max      10732.002367
Name: CO2_emission, dtype: float64

Looking at the numbers, both CO2 emissions and tree cover loss have distributions that are heavily skewed, with a few countries having extremely high values compared to the rest. For example, the maximum CO2 emission is over 10,000 metric tons, while the mean is only 59. This tells me that the average (mean) is being pulled up by these outliers, which doesn’t really represent what’s happening in most countries. The median, on the other hand, isn’t affected by these extremes, making it a much better choice for understanding the typical country’s emissions and deforestation. So, if I want to show a clearer picture of what’s happening at the country level, I’d definitely go with the median.

country_median = data.groupby('country').agg({
    'umd_tree_cover_loss__ha':'median',
    'CO2_emission':'median'
}).reset_index()
country_median
country umd_tree_cover_loss__ha CO2_emission
0 Afghanistan 97.749480 0.188230
1 Albania 1188.248196 0.085429
2 Algeria 5763.909736 1.657544
3 Angola 161718.969611 0.226583
4 Antigua and Barbuda 25.463260 0.000000
... ... ... ...
173 Vanuatu 377.560462 0.000000
174 Venezuela 101285.582103 21.393005
175 Vietnam 132423.224113 17.462579
176 Zambia 84831.582310 0.067808
177 Zimbabwe 9631.632143 0.950000

178 rows × 3 columns

fig = px.scatter(
country_median,
x= 'umd_tree_cover_loss__ha',
y='CO2_emission',
text='country',
labels={
    'umd_tree_cover_loss__ha': 'Median Tree Cover Loss (ha)',
    'CO2_emission': 'Median CO2 Emission'},
title='Median Tree Cover Loss vs Median CO2 Emission by Country 2001 - 2019')

fig.update_traces(marker=dict(size=10, opacity=0.7), textposition='top center')
fig.update_layout(
title_font_size=16,
xaxis_title='Median Tree Cover Loss (ha)',
yaxis_title='Median CO2 Emission',
template='plotly_white'
)
fig.show()

This chart gives a fascinating look at how tree cover loss and CO2 emissions compare across countries from 2001 to 2019. Starting with the United States, it’s clear that while the U.S. has high CO2 emissions, its tree cover loss is relatively low compared to countries like Russia or Brazil. This suggests that most of the U.S.’s emissions come from industries like energy and transportation rather than deforestation.

China tells a similar story. It has some of the highest CO2 emissions but very little tree cover loss. This reflects the country’s reliance on heavy industry and fossil fuels to drive its rapid economic growth. On the other hand, Russia stands out with significant tree cover loss and moderate emissions. This could be due to logging and land-use changes contributing to its numbers.

Brazil and Indonesia, however, are different. Both countries show high levels of tree cover loss, but their CO2 emissions are not as high as those of industrial giants like the U.S. or China. For Brazil, the Amazon’s deforestation is a major factor, while in Indonesia, activities like palm oil plantations play a big role.

Then there’s the cluster of countries near the lower end of the chart, where both tree cover loss and emissions are minimal. These are likely smaller or less industrialized nations with limited impact on global emissions.

What this chart really shows is how different each country’s story is. For some, emissions are driven by industrialization, while for others, it’s deforestation and land-use changes that make the difference. It’s a reminder that tackling emissions requires tailored solutions that consider each country’s unique challenges.

Step 9 - Correlation in Each Region#

from scipy.stats import pearsonr

results = []
for region, group in forest_aggregate.groupby('continent_region'):
    corr, p_value = pearsonr(group['umd_tree_cover_loss__ha'], group['CO2_emission'])
    results.append({
        'continent_region': region,
        'correlation': corr,
        'p_value': p_value,
        'is_significant': 'Yes' if p_value < 0.05 else 'No'
    })


correlations = pd.DataFrame(results)
correlations
continent_region correlation p_value is_significant
0 Africa 0.905043 1.015996e-07 Yes
1 Americas -0.121429 6.204574e-01 No
2 Asia 0.766577 1.290445e-04 Yes
3 Europe -0.578872 9.407491e-03 Yes
4 Oceania 0.390274 9.854547e-02 No
correlations['significance_color'] = correlations['is_significant'].map({'Yes': 'green', 'No': 'red'})

fig = px.bar(
    correlations,
    x='continent_region',
    y='correlation',
    title="Correlation Between Tree Cover Loss and CO2 Emissions by Region (Significance Highlighted)",
    labels={
        "continent_region": "Region",
        "correlation": "Correlation Coefficient"
    },
    color='is_significant',
    text='correlation',
    color_discrete_map={'Yes': 'green', 'No': 'red'}
)

fig.update_traces(
    texttemplate='%{text:.2f} (%{customdata[1]})',
    textposition='outside',
    customdata=correlations[['continent_region', 'is_significant']]
)

fig.update_layout(
    showlegend=True, 
    yaxis_title="Correlation Coefficient",
    legend_title="Significance",
)

fig.show()

The numbers tell an interesting story about how deforestation and CO2 emissions are connected in different regions. In Africa, the correlation is strong at 0.91, and it’s statistically significant, showing that deforestation has a clear and direct impact on emissions. Similarly, Asia has a strong and significant correlation of 0.77, reflecting how activities like land-use changes and industrial expansion drive emissions. Europe is quite different, with a negative correlation of -0.58, which is also significant. This likely shows the impact of successful policies like reforestation and the shift to cleaner energy. In the Americas, the correlation is weak at -0.12 and not significant, suggesting that emissions here are influenced more by industry and transportation rather than deforestation. Oceania also has a weak correlation of 0.39, and it’s not significant either, highlighting the region’s smaller role in these dynamics. These numbers really show how each region has its own unique relationship between deforestation and emissions.

from scipy.stats import linregress

regions = forest_aggregate.groupby('continent_region')

simple_regression_results = []


for region, group in regions:
    slope, intercept, r_value, p_value, std_err = linregress(
        group['umd_tree_cover_loss__ha'], group['CO2_emission']
    )
    
    simple_regression_results.append({
        'region': region,
        'slope': slope,
        'intercept': intercept,
        'r_squared': r_value**2,
        'p_value': p_value,
        'significant': p_value < 0.05
    })

simple_regression_df = pd.DataFrame(simple_regression_results)
print(simple_regression_df)
     region     slope     intercept  r_squared       p_value  significant
0    Africa  0.000065   1540.419573   0.819104  1.015996e-07         True
1  Americas -0.000005  15763.538362   0.014745  6.204574e-01        False
2      Asia  0.001006   9540.867791   0.587640  1.290445e-04         True
3    Europe -0.000037  13189.972045   0.335093  9.407491e-03         True
4   Oceania  0.000009    854.346422   0.152314  9.854547e-02        False
fig = px.bar(
    simple_regression_df,
    x='region',
    y='r_squared',
    color='significant',  # Highlight statistically significant regions
    title="R-squared and Statistical Significance by Region",
    labels={'region': 'Region', 'r_squared': 'R-squared Value'},
    text='r_squared'
)
fig.update_traces(texttemplate='%{text:.2f}', textposition='outside')
fig.show()

This chart paints a clear picture of how deforestation and CO2 emissions are connected across regions.

  • Africa stands out with the highest R-squared value of 0.82, meaning deforestation explains a lot of the variation in emissions, and this relationship is statistically significant.

  • In Asia, the R-squared is 0.59, which is still significant and shows a strong link between deforestation and emissions, likely tied to land-use changes and industrial growth.

  • Europe is interesting because its R-squared is lower, at 0.34, but the connection is still significant, reflecting efforts like reforestation and cleaner energy.

  • For the Americas, the story is different, its R-squared is barely 0.01, and the relationship isn’t significant, showing that emissions here are driven by other sectors like transportation and industry.

  • Oceania has an R-squared of 0.14, which is also not significant, pointing to a weaker connection.

This really shows how the role of deforestation in emissions varies widely depending on the region and its unique dynamics.

Final Summary#

When I set out to explore how deforestation correlates with CO2 emissions globally, I wanted to understand whether cutting down forests, often for agriculture or industrial purposes, is truly contributing to climate change. Looking at the data, the connection is clear in some regions but less so in others, showing that the relationship between deforestation and emissions is more complex than I initially thought.

In regions like Africa and Asia, the story is straightforward. Africa has the strongest link, with an R-squared of 0.82, meaning deforestation directly explains most of the CO2 emissions in the region. This makes sense, as land-use changes like agriculture are major contributors to emissions. Asia follows closely with an R-squared of 0.59, reflecting how industrial expansion and deforestation go hand in hand, fueling rising emissions. Both of these regions show statistically significant relationships, reinforcing the idea that deforestation is a key driver of CO2 emissions here.

In Europe, the connection is weaker, with an R-squared of 0.34, though still statistically significant. This suggests that while deforestation plays a role, policies like reforestation and sustainable land management have helped offset its impact on emissions. In contrast, the Americas and Oceania tell a very different story. The correlation is minimal, with R-squared values of 0.01 and 0.14, respectively, and neither is statistically significant. This points to other factors, like industrial and transportation emissions, playing a bigger role in these regions.

Deforestation is undoubtedly a driver of climate change, but its impact varies depending on the region. In places like Africa and Asia, where deforestation is closely tied to emissions, reducing forest loss could make a big difference in combating climate change. On the other hand, regions like Europe show how reforestation and strong environmental policies can weaken the link between deforestation and emissions. This journey from my initial question has shown me that tackling deforestation’s role in climate change requires region-specific strategies, balancing global solutions with local realities.

Collaboration and Sources#

  • This project was completed independently.

  • I used ChatGPT to review my code, verify the logic, enhance data visualizations, and ensure the insights were clear and concise.